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Abstract —The problem of prediction consists in forecasting 
the conditional distribution of the next outcome given the past. 
Assume that the source generating the data is such that there 
is a stationary ergodic predictor whose error converges to zero 
(in a certain sense). The question is whether there is a universal 
predictor for all such sources, that is, a predictor whose error 
goes to zero if any of the sources that have this property is chosen 
to generate the data. This question is answered in the negative, 
contrasting a number of previously established positive results 
concerning related but smaller sets of processes. 

1. Introduction 

The basic problem is predicting the conditional probability 
distribution , Xn) over the next outcome Xn+i given 

a sequence of observations xi,..., generated by an un¬ 
known time-series distribution ji. Since /) gives a conditional 
distribution for every xi,..., x„ it defines itself a time-series 
distribution. Thus, the source of data and the predictor are 
objects of the same kind. Traditionally, one assumes Xi to be 
independent and identically distributed, or that /r belongs to 
one of the well-studied parametric families. However, in ap¬ 
plications involving hard-to-model data sources such as stock 
market, human-written texts or biological sources, it is often 
assumed, instead, that /r belongs to some large (nonparametric) 
family of time-series distributions. Examples of such families 
are the set of all finite-memory distributions or the set of all 
stationary distributions. The hope is not that the unknown data 
source under study actually belongs to such a family - for 
example, that a human-written text obeys the finite-memory 
assumption or that the stock market is stationary - but, rather, 
that the considered family of sources is good enough for the 
forecasting task at hand. Such a “hope,” however, remains 
informal, since the theoretical results concern the setting when 
the unknown source belongs to the family. 

Here we consider a formalization of the “good enough for 
prediction” setting proposed in m. Specifically, we are asking 
whether a predictor can be constructed which is asymptotically 
consistent (prediction error goes to 0) on any source for which 
a consistent predictor exists in a given family. Thus, given 
a set S of distributions, we consider the set 5+ := {of all 
distributions /i such that there exists a distribution v £ S such 
that the prediction error of v on sequences generated by pt goes 
to zero}. We are asking whether there exists a predictor that is 
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consistent on all distributions in In this work the family 
S in question is the set of all stationary ergodic distributions, 
and the question is answered in the negative. This negative 
result is rather tight; in particular, the same proof shows that 
the set S can be replaced by the set of all Hidden Markov 
chains with a countable set of states (maintaining the negative 
result), while a consistent predictor exists if we only consider 
Hidden Markov chains with a finite set of states. 

Prior work. A consistent predictor for the set S of all 
finite-alphabet stationary distributions has been constructed 
in El. Here the prediction quality is with respect to Cesaro- 
averaged Kullback-Leibler (KL) divergence, which is required 
to converge to 0 either in expectation or with probability 1. 
The same work shows that an analogous result is impossible 
to obtain without Cesaro averaging (the latter negative result 
was obtained independently in the unpublished thesis of Bailey 
ll3l). The positive result admits a number of generalizations 
and extensions, including those to continuous alphabets H, 
0 , 0 , 0 , 0 , 0 . 

Prediction with expert advice (see Qol for an overview) 
presents a different approach to the problem of prediction. 
Here one assumes that the data source to predict is an arbitrary 
deterministic sequence, and makes no further assumptions on 
it. The goal is also different; rather than trying to make the 
prediction error decrease to 0 (which is impossible in this 
setting), it is required to predict as well as any expert from 
a given set. An important difference is that in this setting 
one does not give probability forecasts of the next outcome 
but just deterministic predictions, and the quality is measured 
(according to some loss function) with respect to the prediction 
of each expert. The set of experts is usually small, most 
typically finite; the class of all i.i.d. predictors also has been 
considered im. While this approach is very close to the one 
taken here, it does not allow one to look at predictors (experts) 
and data source as objects of the same kind, thus making it 
difficult to formulate our question of interest. 

A connection between the settings was made in the work 0, 
which formulates three problems. The first one is the classical 
problem of constructing a predictor that is asymptotically 
consistent (its error goes to 0) if any process from an (arbitrary, 
given) set S is chosen to generate the data. The second is 
the one considered in this work: asymptotically consistent 


prediction of sequences generated by every source for which 
there is an asymptotically consistent predictor in a given set S. 
The third setting removes the “asymptotically consistent” 
part: it requires constructing a predictor that predicts any 
source whatsoever as well as any predictor in a given set S. 
Thus, the latter formulation is the worst-case analysis akin 
to expert advice (the only difference is that we still try to 
forecast probabilities, rather than individual outcomes). Here 
all predictors and sources are just time-series distributions. The 
three problems are naturally ordered in difficulty: if the set S 
is the same in all the three problems, then any solution to the 
third problem is a solution to the second, and any solution to 
the second is a solution to the hrst. For the set of all stationary 
processes, it is known since ^ that the hrst problem admits 
a solution. It is shown in m that the third problem (worst- 
case) does not, but the question of whether the second problem 
admits a solution for this set was left open; here we answer 
it in the negative. 

II. Preliminaries 

Let <T be a hnite set. Since we are after a negative 
result, selecting X := {0,1} is not a restriction, so we hx 
this choice. The notation xi,,n is used for xi,...,Xn- We 
consider time-series distributions, that is, probability measures 
on fl := {X^,B) where B is the sigma-held generated by the 
cylinder sets [xi,,n], Xi G T”, n € N and is the set of 

all inhnite sequences that start with xi,,n. We use for the 
expectation with respect to a measure /r. 

For two measures ^ and p introduce the expected cumulative 
Kullback-Leibler divergence (KL divergence) as 


dnip, P) 


n 

:= ^ ^ p,{xt = a\xi,,t-i) log 


p{xt = a\xi„t-i) 
p{xt = a\xi„t-i)' 


In words, we take the expected (over data) cumulative (over 
time) KL divergence between p- and p-conditional (on the past 
data) probability distributions of the next outcome. Dehne also 

d{p,p) := liminfp). 

n—^co 77, 

We say that p predicts p (in expected average KL divergence) 
if d{p,p) = 0. It is easy to see that 

dn{p,p)= log , 

p{Xl..n) 

which makes expected average KL divergence a convenient 
measure of prediction quality to study. 

Let the set V be the set of all time-series distributions over 
H. A distribution p gV is stationary if for every i,j G N and 
every A G X^, we have 


shown to be equivalent to the usual one formulatedf in terms 
of shift-invariant sets in.) 

III. Main result 

Denote S C V the set of all stationary ergodic time-series 
distributions. Dehne 

:= {p G P : Bi' G S diy, p) = 0}. 

Theorem 1. For any predictor p G V there is a measure 
p G such that d{p, p) > 1. 

Proof: We will show that the set 5+ includes the set V of 
all Dirac measures, that is, of all measures concentrated on one 
deterministic sequence. The statement of the theorem follows 
directly from this, since for any p one can hnd a sequence 
Ti, ..., Xm ■ ■ ■ G X°° such that p{xn\xi..n-i) < 1/2 for all 
n G N. 

To show that T> C S'^, we will construct, for any given 
sequence x := Xi,... ,Xn, - ■ ■ G X°°, a measure p^ such that 
d{6x,Px) = 0 where Sx is the Dirac measure concentrated 
on X. These measures are constructed as functions of a 
stationary Markov chain with a countably inhnite set of states. 
The construction is based on the one used in ||2j (see also fTl). 

The Markov chain M has the set N of states. From each 
state j it transits to the state j -b 1 with probability pj := 
j^/(j + 1)^ and to the state 1 with the remaining probability, 
1 —pj. Thus, M spends most of the time around the state 1, 
but takes rather long runs towards outer states: long, since pj 
tends to 1 rather fast. We need to show that it does not “run 
away” too much; more precisely, we need to show M has a 
stationary distribution. For this, it is enough to show that the 
state 1 is positive recurrent (see, e.g., Ifni Chapter VIII] for 
the dehnitions and facts about Markov chains used here). This 

('n') 

can be verihed directly as follows. Denote the probability 
that starting from the state 1 the chain returns to the state 1 
for the hrst time in exactly n steps. We have 

/ 2 \ -1 

/!r' = (i-pjll!>. = (i-(:r^) 

i—1 ^ ' ' '' 

To show that the state 1 is positive recurrent we need 
^ Indeed, < 3/n^ which is 

summable. It follows that M has a stationary distribution, 
which we call tt. 

For a given sequence x := xi,..., • • • G A°°, the 

measure px is constructed as a function px of the chain M 
taken with its stationary distribution as the initial one. We 
dehne gx{j) = Xj for all j G N. Since M is stationary, so 
is Px- It remains to show that d{5x^ Px) = 0. Indeed, we have 


p{Xi..3 — A) — p{Xi,,i+j-i — A). 

A stationary distribution p is called ergodic if for all n G 
N, A G with probability 1 we have lim„_,.oo A) = 

p(A), where stands for the frequency of occur¬ 

rence of the word A in -Vi (The latter dehnition can be 


dn i^dx , px ) 


iogpxixi, . . . ,X„) < - log I^TTi Y\p3 
= -logTTi -|-21og(n-|- 1) = o(n). 





A. Other sets of measures to predict and tightness of the result 

One can ask how “tight” is the negative result presented, 

or, in other words, whether the set S was too general a point 
of departure in the first place. 

To answer this question, first note that, as mentioned before, 
the work El shows (by an explicit construction) that there 
is a universal predictor for the set S (of stationary ergodic 
distributions) itself, that is, there exist a measure p such that 
d{pL, p) = 0 for any p G S. 

Next, from the proof of Theorem [T] one can see that it is 
possible to replace the set S in its formulation with the set 
of all hidden Markov chains with a countably infinite set of 
states. The latter set is in fact much smaller than the set S. 
Indeed, S can be considered as the set of all stationary hidden 
Markov processes with an uncountably infinite (specifically, 
set of states, giving the “much smaller” comparison 
above a precise set-theoretic meaning. 

Passing to positive results for the problem of prediction 
considered, for the set Ai of all finite-memory processes, |[T1 
Theorem 15] shows that there is a universal predictor for the 
set Ad+. Moreover, it is easy to extend the proof of the latter 
result to all hidden Markov processes with finitely many states. 
Thus, it is possible to predict all measures that are predicted 
by a hidden Markov chain with finitely many states, but not 
with a countably infinite set of states, making the negative 
result rather tight. 

B. Other measures of prediction quality 

So far, we have been measuring the quality of prediction 
in terms of expected average KL divergence. Measuring it 
differently would change both the set 5+ and the requirement 
on the predictor that would have to predict all measures from 
this set. Thus, the result of Theorem[T]does not directly entail a 
similar statement about neither weaker nor stronger measures 
of prediction quality. 

However, a quick look at the proof of Theorem [1] shows that 
the construction it employs is rather universal. Specifically, 
predicts x also almost surely rather than in expectation (simply 
because the sequence a; is a deterministic sequence). 


Moreover, the prediction error (of p^ on x) convergence to 0 
in just about any sense one can think of, for example, one can 
replace KL divergence with the absolute loss, squared loss, 
etc. This implies that Theorem [T] holds for these measures 
of prediction quality as well. Furthermore, this shows that 
the same result holds if we consider different notions of 
prediction on different sides of the question: asking whether it 
is possible to predict in (say) expected average KL divergence 
all measures that are predicted by some stationary ergodic 
measure when (say) the convergence has to be with probability 
1 and there is no time-averaging. 

Thus, the result (placed in the title) appears to be general 
and not an artefact of the measure of prediction quality 
considered. 
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